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Being  Bayesian  About  Network  Structure 

A  Bayesian  Approach  to  Structure  Discovery  in  Bayesian  Networks 
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Jerusalem ,  91904 ,  Israel 

Daphne  Koller  (koller@cs.stanford.edu) 
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Stanford,  CA  94305-9010 

Abstract.  In  many  domains,  we  are  interested  in  analyzing  the  structure  of  the  underlying 
distribution,  e.g.,  whether  one  variable  is  a  direct  parent  of  the  other.  Bayesian  model  selection 
attempts  to  find  the  MAP  model  and  use  its  structure  to  answer  these  questions.  However, 
when  the  amount  of  available  data  is  modest,  there  might  be  many  models  that  have  non- 
negligible  posterior.  Thus,  we  want  compute  the  Bayesian  posterior  of  a  feature,  i.e.,  the  total 
posterior  probability  of  all  models  that  contain  it.  In  this  paper,  we  propose  a  new  approach 
for  this  task.  We  first  show  how  to  efficiently  compute  a  sum  over  the  exponential  number 
of  networks  that  are  consistent  with  a  fixed  order  over  network  variables.  This  allows  us  to 
compute,  for  a  given  order,  both  the  marginal  probability  of  the  data  and  the  posterior  of  a 
feature.  We  then  use  this  result  as  the  basis  for  an  algorithm  that  approximates  the  Bayesian 
posterior  of  a  feature.  Our  approach  uses  a  Markov  Chain  Monte  Carlo  (MCMC)  method, 
but  over  orders  rather  than  over  network  structures.  The  space  of  orders  is  smaller  and  more 
regular  than  the  space  of  structures,  and  has  much  a  smoother  posterior  “landscape”.  We 
present  empirical  results  on  synthetic  and  real-life  datasets  that  compare  our  approach  to  full 
model  averaging  (when  possible),  to  MCMC  over  network  structures,  and  to  a  non-Bayesian 
bootstrap  approach. 

Keywords:  Bayesian  Networks,  Structure  Learning,  MCMC,  Bayesian  Model  Averaging 
Abbreviations:  BN  -  Bayesian  Network;  MCMC  -  Markov  Chain  Monte  Carlo 


1.  Introduction 


In  the  last  decade  there  has  been  a  great  deal  of  research  focused  on  the  prob¬ 
lem  of  learning  Bayesian  networks  (BNs)  from  data  (Buntine,  1996;  Heck- 
erman,  1998).  An  obvious  motivation  for  this  problem  is  to  learn  a  model 
that  we  can  then  use  for  inference  or  decision  making,  as  a  substitute  for  a 
model  constructed  by  a  human  expert.  In  certain  cases,  however,  our  goal  is  to 
learn  a  model  of  the  system  not  for  prediction,  but  for  discovering  the  domain 
structure.  For  example,  we  might  want  to  use  Bayesian  network  learning  to 
understand  the  mechanisms  by  which  genes  in  a  cell  produce  proteins,  which 
in  turn  cause  other  genes  to  express  themselves,  or  prevent  them  from  doing 
so  (Friedman  et  al.,  2000).  In  this  case,  our  main  goal  is  to  discover  the 
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causal  and  dependence  relations  between  the  expression  levels  of  different 
genes  (Lander,  1999). 

The  most  common  approach  to  discovering  structure  is  to  use  learning 
with  model  selection  to  provide  us  with  a  single  high-scoring  model.  We  then 
use  that  model  (or  its  Markov  equivalence  class)  as  our  model  for  the  structure 
of  the  domain.  Indeed,  in  small  domains  with  a  substantial  amount  of  data,  it 
has  been  shown  that  the  highest  scoring  model  is  orders  of  magnitude  more 
likely  than  any  other  (Heckerman  et  al.,  1997).  In  such  cases,  the  use  of  model 
selection  is  a  good  approximation.  Unfortunately,  there  are  many  domains  of 
interest  where  this  situation  does  not  hold.  In  our  gene  expression  example,  it 
is  now  possible  to  measure  of  the  expression  levels  of  thousands  of  genes  in 
one  experiment  (Lander,  1999)  (where  each  gene  is  a  random  variable  in  our 
model  (Friedman  et  al.,  2000)),  but  we  typically  have  only  a  few  hundred  of 
experiments  (each  of  which  is  a  single  data  case).  In  cases,  like  this,  where 
the  amount  of  data  is  small  relative  to  the  size  of  the  model,  there  are  likely  to 
be  many  models  that  explain  the  data  reasonably  well.  Model  selection  makes 
a  somewhat  arbitrary  choice  between  these  models,  and  therefore  we  cannot 
be  confident  that  the  model  is  a  true  representation  of  the  underlying  process. 

Given  that  there  are  many  qualitatively  different  structures  that  are  ap¬ 
proximately  equally  good,  we  cannot  learn  a  unique  structure  from  the  data. 
Moreover,  in  many  learning  scenarios  there  are  exponentially  many  structures 
that  are  “reasonably”  good  given  the  data.  Thus,  enumerating  these  structures 
is  also  impractical.  However,  there  might  be  certain  features  of  the  domain, 
e.g.,  the  presence  of  certain  edges,  that  we  can  extract  reliably.  As  an  extreme 
example,  if  two  variables  are  very  strongly  correlated  (e.g.,  deterministically 
related  to  each  other),  it  is  likely  that  an  edge  between  them  will  appear  in 
any  high-scoring  model.  In  many  discovery  problems,  extracting  these  struc¬ 
tural  features  is  of  great  interest.  Bayesian  learning  allows  us  to  estimate  the 
strength  with  which  the  data  indicates  the  presence  of  a  certain  feature.  The 
Bayesian  score  of  a  model  is  simply  its  posterior  probability  given  the  data. 
Thus,  we  can  estimate  the  extent  to  which  a  feature,  e.g.,  the  presence  of  an 
edge,  is  likely  given  the  data  by  estimating  its  probability: 

P(f\D)  =  '£P(G\D)f(G),  (1) 

G 

where  f(G)  is  1  if  the  feature  holds  in  G  and  0  otherwise.  If  this  probability 
is  close  to  1,  then  almost  any  high-scoring  model  contains  the  feature  .  On 
the  other  hand,  if  the  probability  is  low,  we  know  that  the  feature  is  absent  in 
the  most  likely  models. 

The  number  of  BN  structures  is  super-exponential  in  the  number  of  ran¬ 
dom  variables  in  the  domain;  therefore,  this  summation  can  be  computed  in 
closed  form  only  for  very  small  domains,  or  those  in  which  we  have  additional 
constraints  that  restrict  the  space  (as  in  (Heckerman  et  al.,  1997)).  Altema- 


journal.tex;  18/07/2000;  2:56;  p.2 


Being  Bayesian  About  Network  Structure  3 

tively,  this  summation  can  be  approximated  by  considering  only  a  subset  of 
possible  structures.  Several  approximations  have  been  proposed  (Madigan 
and  Raftery,  1994;  Madigan  and  York,  1995).  One  theoretically  well-founded 
approach  is  to  use  Markov  Chain  Monte  Carlo  (MCMC)  methods:  we  define 
a  Markov  chain  over  structures  whose  stationary  distribution  is  the  posterior 
P(G  |  D),  we  then  generate  samples  from  this  chain,  and  use  them  to  estimate 
Eq.  (1).  This  approach  is  quite  popular,  and  variants  have  been  used  by  Madi¬ 
gan  and  York  (1995),  Madigan  et  al.  (1996),  Giudici  and  Green  (1999),  and 
Giudici  et  al.  (2000). 

In  this  paper,  we  propose  a  new  approach  for  evaluating  the  Bayesian 
posterior  probability  of  certain  structural  network  properties.  Our  approach 
is  based  on  two  main  ideas.  The  first  is  an  efficient  closed  form  equation  for 
summing  over  all  networks  with  at  most  k  parents  per  node  (for  some  constant 
k)  that  are  consistent  with  a  fixed  order  over  the  nodes.  This  equation  allows 
us  both  to  compute  the  overall  probability  of  the  data  for  this  set  of  network- 
s,  and  to  compute  the  posterior  probability  of  certain  structural  features  — 
edges  and  Markov  blankets —  over  this  set.  The  second  idea  is  the  use  of  an 
MCMC  approach,  but  over  orders  of  the  network  variables  rather  than  directly 
on  BN  structures. 

The  space  of  orders  is  much  smaller  than  the  space  of  network  structures; 
it  also  appears  to  be  much  less  peaked,  allowing  much  faster  mixing  (i.e., 
convergence  to  the  stationary  distribution  of  the  Markov  chain).  We  present 
empirical  results  illustrating  this  observation,  showing  that  our  approach  has 
substantial  advantages  over  direct  MCMC  over  BN  structures.  The  Markov 
chain  over  orders  mixes  much  faster  and  more  reliably  than  the  chain  over 
network  structures.  Indeed,  different  runs  of  MCMC  over  networks  typically 
lead  to  very  different  estimates  in  the  posterior  probabilities  of  structural  fea¬ 
tures,  illustrating  poor  convergence  to  the  stationary  distribution;  by  contrast, 
different  runs  of  MCMC  over  orders  converge  reliably  to  the  same  estimates. 
We  also  present  results  showing  that  our  approach  accurately  detects  dom¬ 
inant  features  even  with  sparse  data,  and  that  it  outperforms  both  MCMC 
over  structures  and  the  non-Bayesian  bootstrap  approach  of  Friedman  et  al. 
(1999a). 

2.  Bayesian  learning  of  Bayesian  networks 

2.1.  The  BayesiAn  learning  framework 

Consider  the  problem  of  analyzing  the  distribution  over  some  set  of  random 
variables  Xi, . . .  ,X„,  based  on  a  fully  observed  data  set  D  =  {x[l], . . .  ,x[M]}, 
where  each  x[j]  is  a  complete  assignment  to  the  variables  Xi, . . .  ,X„. 

The  Bayesian  learning  paradigm  tells  us  that  we  must  define  a  prior  prob¬ 
ability  distribution  P($)  over  the  space  of  possible  Bayesian  networks  eB. 
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This  prior  is  then  updated  using  Bayesian  conditioning  to  give  a  posterior 
distribution  P($  |  D)  over  this  space. 

For  Bayesian  networks,  the  description  of  a  model  %  has  two  components: 
the  structure  G  of  the  network,  and  the  values  of  the  numerical  parame¬ 
ters  0g  associated  with  it.  For  example,  in  a  discrete  Bayesian  network  of 
structure  G,  the  parameters  0g  define  a  multinomial  distribution  0Xi|u  for 
each  variable  X,-  and  each  assignment  of  values  u  to  PaG(X;).  If  we  consider 
Gaussian  Bayesian  networks  over  continuous  domains,  then  0x,  |u  contains  the 
coefficients  for  a  linear  combination  of  u  and  a  variance  parameter. 

To  define  the  prior  P(C8),  we  need  to  define  a  discrete  probability  distri¬ 
bution  over  graph  structures  G,  and  for  each  possible  graph  G,  to  define  a 
density  measure  over  possible  values  of  parameters  0g- 

The  prior  over  structures  is  usually  considered  the  less  important  of  the 
two  components.  Unlike  other  parts  of  the  posterior,  it  does  not  grow  as  the 
number  of  data  cases  grows.  Hence,  relatively  little  attention  has  been  paid 
to  the  choice  of  structure  prior,  and  a  simple  prior  is  often  chosen  largely  for 
pragmatic  reasons.  The  simplest  and  therefore  most  common  choice  is  a  uni¬ 
form  prior  over  structures  (Heckerman,  1998).  To  provide  a  greater  penalty 
to  dense  networks,  one  can  define  a  prior  using  a  probability  P  that  each  edge 
be  present;  then  networks  with  m  edges  have  prior  probability  proportional  to 

Pw(l  —  P)©~m  (Buntine,  1991).  An  alternative  prior,  and  the  one  we  use  in 
our  experiments,  considers  the  number  of  options  in  determining  the  families 
of  G.  Intuitively,  if  we  decide  that  a  node  X,  has  k  parents,  then  there  are  ("J1) 
possible  parents  sets.  If  we  assume  that  we  choose  uniformly  from  these,  we 
get  a  prior: 

Note  that  the  negative  logarithm  of  this  prior  corresponds  to  the  description 
length  of  specifying  the  parent  sets,  assuming  that  the  cardinality  of  these 
sets  are  known.  Thus,  we  implicitly  assume  that  cardinalities  of  parent  sets 
are  uniformly  distributed. 

A  key  property  of  all  these  priors  is  that  they  satisfy: 

—  Structure  modularity  The  prior  P(G)  can  be  written  in  the  form 

^(G)  =  np(*>Pa<;(X/))- 

i 

That  is,  the  prior  decomposes  into  a  product,  with  a  term  for  each  family 
in  G.  In  other  words  the  choices  of  the  families  for  the  different  nodes  are 
independent  a  priori. 

Next  we  consider  the  prior  over  parameters,  P(0g  |  G).  Here,  the  form  of 
the  prior  varies  depending  on  the  type  of  parametric  families  we  consider.  In 
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discrete  networks,  the  standard  assumption  is  a  Dirichlet  prior  over  0X.|„  for 
each  node  X,-  and  each  instantiation  11  to  its  parents  (Heckerman,  1998).  That 
is: 


P(Qg  I  G ) 
P(Qxt\u  I  G) 


n  ^fi«iG) 

i,u£Vfli(PaG(Xf)) 

r(aX.|")  If  Qa%  1“ 

iL,6vaWrK*iu)y jcttiu’ 


where  the  a^it|u  are  the  hyperparameters  of  the  Dirichlet  distribution  for  0X.|U, 
and  aXj|u  =  ’tk&xik |u-  1°  Gaussian  networks,  we  might  use  a  Wishart  pri¬ 
or  (Heckerman  and  Geiger,  1995).  For  our  purpose,  we  need  only  require  that 
the  prior  satisfies  two  basic  assumptions,  as  presented  by  Heckerman  et  al. 
(1995): 


—  Global  parameter  independence:  Let  0x,|PaG(Xi)  be  the  parameters  spec¬ 
ifying  the  behavior  of  the  variable  X,-  given  the  various  instantiations  to 
its  parents.  Then  we  require  that 

A0G|G)  =  n^(0xi|PaG(x()|G)  (3) 

i 

—  Parameter  modularity:  Let  G  and  G'  be  two  graphs  in  which  Pa<j (X,)  = 
PaG/(X,)=Uthen 

P(0Xf|U|G)  =  P(0Xj|U|G')  (4) 

Once  we  define  the  prior,  we  can  examine  the  form  of  the  posterior  prob¬ 
ability.  Using  Bayes  rule,  we  have  that 

P(G|D)«*P(D|G)P(G). 

The  term  P(D  |  G)  is  the  marginal  likelihood  of  the  data  given  G,  and  is 
defined  the  integration  over  all  possible  parameter  values  for  G. 

P{D  |  G)  =  jp(D\  G,0g)P(0g  I  G)dQ0 

The  term  P(D  \  G,0g)  is  simply  the  probability  of  the  data  given  a  specific 
Bayesian  network.  When  the  data  is  complete,  this  term  is  simply  a  product 
of  conditional  probabilities. 

Using  the  above  assumptions,  one  can  show  (see  (Heckerman  et  al.,  1995)): 

THEOREM  2.1.:  IfD  is  complete  and  P{G)  satisfies  parameter  independence 
and  parameter  modularity ,  then 

P(D  I  G)=Y\  f  riP(*M  I  PaG  )  [m]  >  0X,- 1  PaG  (Xj ) )  F  (  0X;  |  Pac  (X,- ) )  ^0X;  |  PaG  (X,)  • 

i  J  m 
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If  the  prior  also  satisfies  structure  modularity,  we  can  also  conclude  that 
posterior  probability  decomposes: 

P(G  |  D)  oc  P(D  |  G)P(G)  =  l\score{XhP^(Xi)  \  D) 


where 


score 


(X,-,U|D)  =  P(X,-,U)  [ l»M>^1|u)P(0x,|u)^eXj|U 

^  tn 


2.2.  Bayesian  model  averaging 


Recall  that  our  goal  is  to  compute  the  posterior  probability  of  some  feature 
f(G)  over  all  possible  graphs  G.  This  is  equal  to: 


P(f\D )  =  £/(G)P(G|D) 

G 


The  problem,  of  course,  is  that  the  number  of  possible  BN  structures  is  super¬ 
exponential:  2°^n\  where  n  is  the  number  of  variables. 

We  can  reduce  this  number  by  restricting  attention  to  structures  G  where 
there  is  a  bound  k  on  the  number  of  parents  per  node.  This  assumption,  which 
we  will  make  throughout  this  paper,  is  a  fairly  innocuous  one.  There  are  few 
applications  in  which  very  large  families  are  called  for,  and  there  is  rarely 
enough  data  to  support  robust  parameter  estimation  for  such  families.  From  a 
more  formal  perspective,  networks  with  very  large  families  tend  to  have  low 
score.  Let  Qu  be  the  set  of  all  graphs  with  indegree  bounded  by  k.  Note  that 
the  number  of  structures  in  Qk  is  still  super-exponential:  around  20(*"log"). 

Thus,  exhaustive  enumeration  over  the  set  of  possible  BN  structures  is 
feasible  only  for  tiny  domains  (4-5  nodes).  One  solution,  proposed  by  several 
researchers  (Madigan  and  Raftery,  1994;  Madigan  and  York,  1995;  Hecker- 
man  et  al.,  1997),  is  to  approximate  this  exhaustive  enumeration  by  finding  a 
set  Q  of  high  scoring  structures,  and  then  estimating  the  relative  mass  of  the 
structures  in  Q  that  contains  /: 


lcegP{G\D)fiG) 

2aegP(G\D ) 


(5) 


This  approach  leaves  open  the  question  of  how  we  construct  Q.  The  sim¬ 
plest  approach  is  to  use  model  selection  to  pick  a  single  high-scoring  struc¬ 
ture,  and  then  use  that  as  our  approximation.  If  the  amount  of  data  is  large 
relative  to  the  size  of  the  model,  then  the  posterior  will  be  sharply  peaked 
around  a  single  model,  and  this  approximation  is  a  reasonable  one.  However, 
as  we  discussed  in  the  introduction,  there  are  many  interesting  domains  (e.g.. 
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our  biological  application)  where  the  amount  of  data  is  small  relative  to  the 
size  of  the  model.  In  this  case,  there  is  usually  a  large  number  of  high-scoring 
models,  so  using  a  single  model  as  our  set  £7  is  a  very  poor  approximation. 

A  simple  approach  to  finding  a  larger  set  is  to  record  all  the  structures 
examined  during  the  search,  and  return  the  high  scoring  ones.  However,  the 
set  of  structures  found  in  this  manner  is  quite  sensitive  to  the  search  procedure 
we  use.  For  example,  if  we  use  greedy  hill-climbing,  then  the  set  of  structures 
we  will  collect  will  all  be  quite  similar.  Such  a  restricted  set  of  candidates 
also  show  up  when  we  consider  multiple  restarts  of  greedy  hill-climbing 
and  beam-search.  This  is  a  serious  problem  since  we  run  the  risk  of  getting 
estimates  of  confidence  that  are  based  on  a  biased  sample  of  structures. 

Madigan  and  Raftery  (1994)  propose  an  alternative  approach  called  Oc¬ 
cam’s  window,  which  rejects  models  whose  posterior  probability  is  very  low, 
as  well  as  complex  models  whose  posterior  probability  is  not  substantially 
better  than  a  simpler  model  (one  that  contains  a  subset  of  the  edges).  These 
two  principles  allow  them  to  prune  the  space  of  models  considered,  often  to 
a  number  small  enough  to  be  exhaustively  enumerated.  Madigan  and  Raftery 
also  provide  a  search  procedure  for  finding  these  models. 

An  alternative  approach,  proposed  by  Madigan  and  York  (1995),  is  based 
on  the  use  of  Markov  chain  Monte  Carlo  (MCMC)  simulation.  In  this  case,  we 
define  a  Markov  Chain  over  the  space  of  possible  structures,  whose  stationary 
distribution  is  the  posterior  distribution  P(G  \  D).  We  then  generate  a  set  of 
possible  structures  by  doing  a  random  walk  in  this  Markov  chain.  Assuming 
that  we  continue  this  process  until  the  chain  mixes,  we  can  hope  to  get  a  set 
of  structures  that  is  representative  of  the  posterior.  Related  approaches  have 
also  been  adopted  by  other  researchers.  Giudici  and  Green  (1999)  and  Giudici 
et  al.  (2000)  propose  an  MCMC  approach  over  junction  trees  —  undirected 
graphical  models  that  are  decomposable,  i.e.,  where  graph  is  triangulated. 
Green  (1995)  and  Giudici  et  al.  (2000)  also  extend  the  MCMC  methodology 
to  cases  where  closed-form  integration  over  parameters  is  infeasible,  by  defin¬ 
ing  a  reversible  jump  Markov  Chain  that  traverses  the  space  of  parameters 
as  well  as  structure.  Madigan  et  al.  (1996)  provide  an  approach  for  MCM- 
C  sampling  over  the  space  of  PDAGs  —  equivalence  classes  over  network 
structures. 

These  MCMC  solutions  are  the  only  approach  that  can,  in  principle,  ap¬ 
proximate  true  Bayesian  model  averaging  by  sampling  from  the  posterior 
over  network  structures.  They  have  been  demonstrated  with  success  on  a 
variety  of  small  domains,  typically  with  4—14  variables.  However,  there  are 
several  issues  that  potentially  limit  its  effectiveness  for  large  domains  involv¬ 
ing  many  variables.  As  we  discussed,  the  space  of  network  structures  grows 
superexponentially  with  the  number  of  variables.  Therefore,  the  domain  of 
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the  MCMC  traversal  is  enormous  for  all  but  the  tiniest  domains.1  More  im¬ 
portantly,  the  posterior  distribution  over  structures  is  often  quite  peaked,  with 
neighboring  structures  having  very  different  scores.  The  reason  is  that  even 
small  perturbations  to  the  structure  —  a  removal  of  a  single  edge  —  can 
cause  a  huge  reduction  in  score.  Thus,  the  “posterior  landscape”  can  be  quite 
jagged,  with  high  “peaks”  separated  by  low  “valleys”.  In  such  situations, 
MCMC  is  known  to  be  slow  to  mix,  requiring  many  samples  to  reach  the  pos¬ 
terior  distribution.  In  Section  5  we  provide  experimental  evidence  indicating 
that  these  difficulties  do,  indeed,  arise  in  practice. 


3.  Closed  form  for  known  order 

In  this  section,  we  temporarily  turn  our  attention  to  a  somewhat  easier  prob¬ 
lem.  Rather  than  perform  model  averaging  over  the  space  of  all  structures,  we 
restrict  attention  to  structures  that  are  consistent  with  some  known  total  order 
In  other  words,  we  restrict  attention  to  structures  G  where  if  X,-  €  Pa <j(Xy) 
then  i  -<  j.  This  assumption  was  a  standard  one  in  the  early  work  on  learning 
Bayesian  networks  from  data  (Cooper  and  Herskovits,  1992). 

3.1.  Computing  marginal  likelihood 

We  first  consider  the  problem  of  computing  the  probability  of  the  data  given 
the  order: 

P(DH)=  X  P(G\<)P(D\G)  (6) 

Ge  Qk 

Note  that  this  summation,  although  restricted  to  networks  with  bounded  in¬ 
degree  and  consistent  with  -<,  is  still  exponentially  large:  the  number  of  such 
structures  is  still  at  least  2*"logn. 

The  key  insight  is  that,  when  we  restrict  attention  to  structures  consistent 
with  a  given  order  the  choice  of  family  for  one  node  places  no  additional 
constraints  on  the  choice  of  family  for  another.  Note  that  this  property  does 
not  hold  without  the  restriction  on  the  order;  for  example,  if  we  pick  X;  to  be 
a  parent  of  Xj,  then  Xj  cannot  in  turn  be  a  parent  of  X/. 

Therefore,  we  can  choose  a  structure  G  consistent  with  -<  by  choosing, 
independently,  a  family  U  for  each  node  X,-.  The  parameter  modularity  as¬ 
sumption  in  Eq.  (4)  states  that  the  choice  of  parameters  for  the  family  of  X,-  is 
independent  of  the  choice  of  family  for  another  family  in  the  network.  Hence, 
summing  over  possible  graphs  consistent  with  X  is  equivalent  to  summing 
over  possible  choices  of  family  for  each  node,  each  with  its  parameter  prior. 

1  For  the  experiments  done  so  far,  the  larger  domains  (those  with  more  than  7-8  variables) 
were  typically  associated  with  a  large  set  of  structural  constraints  limiting  the  set  of  possible 
structures. 
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*  i 

Given  our  constraint  on  the  size  of  the  family,  the  possible  parent  sets  for  the 
node  X,-  is 

«*,-{  =  {U  :  U  ^X,-,|U|  <  k}. 

where  U  -<  X,-  is  defined  to  hold  when  all  nodes  in  U  precede  X,  in  Let 
be  the  set  of  structures  in  Qk  consistent  with  Given  that,  we  have 

P(D  H)  =  £  Uscore(X»PaG(Xi)\D) 

GeQk,^  * 

=  n  E  score{Xi,  U  |  D) .  (7) 

i 

Intuitively,  the  equality  states  that  we  can  sum  over  all  networks  consistent 
with  -<  by  summing  over  the  set  of  possible  families  for  each  node,  and  then 
multiplying  the  results  for  the  different  nodes.  This  transformation  allows 
us  to  compute  P(D  |-<)  very  efficiently.  The  expression  on  the  right-hand 
side  consists  of  a  product  with  a  term  for  each  node  X,-,  each  of  which  is 
a  summation  over  all  possible  families  for  X,.  Given  the  bound  k  over  the 
number  of  parents,  the  number  of  possible  families  for  a  node  X,-  is  at  most 
Q  <  nk.  Hence,  the  total  cost  of  computing  Eq.  (7)  is  at  most  n-nk  =  nk+l. 

We  note  that  the  decomposition  of  Eq.  (7)  was  first  mentioned  by  Buntine 
(1991),  but  the  ramifications  for  Bayesian  model  averaging  were  not  pursued. 
The  concept  of  Bayesian  model  averaging  using  a  closed-form  summation 
over  an  exponentially  large  set  of  structures  was  proposed  (in  a  different 
setting)  by  Pereira  and  Singer  (1999). 

The  computation  of  P(D  |-<)  is  useful  in  and  of  itself;  as  we  show  in  the 
next  section,  computing  the  probability  P(D  |  — <)  is  a  key  step  in  our  MCMC 
algorithm. 


3.2.  Probabilities  of  features 


For  certain  types  of  features  /,  we  can  use  the  technique  of  the  previous 
section  to  compute,  in  closed  form,  the  probability  P(/  |-<,D)  that  /  holds  in 
a  structure  given  the  order  and  the  data. 

In  general,  if  /(•)  is  a  feature.  We  want  to  compute 


We  have  just  shown  how  to  compute  the  denominator.  The  numerator  is  a  sum 
over  all  structures  that  contain  the  feature  and  are  consistent  with  the  order: 


P(f,D\<)=  £  f(G)P(G\-<)P(D\G)  (8) 


The  computation  of  this  term  depends  on  the  specific  type  of  feature  /. 
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The  simplest  situation  is  when  we  want  to  compute  the  posterior  probabil¬ 
ity  of  a  particular  choice  of  parents  U.  This  in  effect  require  us  to  sum  over 
all  graphs  where  Pa^X,-)  =  U.  In  this  case,  we  can  apply  the  same  closed 
form  analysis  to  (8).  The  only  difference  is  that  we  restrict  to  be  the 
singleton  {U}.  Since  the  terms  that  sum  over  the  parents  of  X*  for  k  ^  j  are 
not  disturbed  by  this  constraint,  they  cancel  out  from  the  equation. 


PROPOSITION  3.1.: 


P(PaG(XI)=U|D,-:)  = 


score  (Xi,U  |  D) 
Eu'e^score(XI-,U'|D)- 


(9) 


A  slightly  more  complex  situation  is  when  we  want  to  compute  the  pos¬ 
terior  probability  of  the  feature  fx^Xp  denoting  an  edge  X,-  — >  Xj.  Again,  we 
can  apply  the  same  closed  form  analysis  to  (8).  The  only  difference  is  that  we 
restrict  <Uj^  to  consist  only  of  subsets  that  contain  X,-. 

PROPOSITION  3.2.: 


P(fxi->xj  H,£>)  = 


X{Ue?4x :  uaX;}  score(X;,U  |  D) 
Sue x  score(X,-,U  |  D) 


A  somewhat  more  subtle  computation  is  required  to  compute  the  posterior 
of  /  *  ,  the  feature  that  denotes  that  X,-  is  in  the  Markov  blanket  of  X/, 

Xr-Xj 

which  holds  if  G  contains  the  edge  X,  ->  Xj,  or  the  edge  Xj  — >■  X„  or  there  is 
a  variable  X*  such  that  both  edges  X,-  ->  X*  and  Xj  — ►  X*  are  in  G. 

Assume,  without  loss  of  generality,  that  X,-  precedes  Xj  in  the  order.  In  this 
case,  X,-  can  be  in  X/s  Markov  blanket  either  if  there  is  an  edge  from  X,-  to  Xj, 
or  if  X;  and  Xj  are  both  parents  of  some  third  node  X/.  We  have  just  shown  how 
the  first  of  these  probabilities  P(/x,_»x/  I  A  -<),  can  be  computed  in  closed 
form.  We  can  also  easily  compute  the  probability  P(X,,X;  e  Pa<;(X/)  |  D,  -<) 
that  both  X,-  and  Xj  are  parents  of  X/:  we  simply  restrict  Zli  ^  to  families  that 
contain  both  X,  and  Xj.  The  key  is  to  note  that  as  the  choice  of  families  of 
different  nodes  are  independent,  these  are  all  independent  events.  Hence,  X,- 
and  Xj  are  not  in  the  same  Markov  blanket  only  if  all  of  these  events  fail  to 
occur.  Thus, 


PROPOSITION  3.3.: 


PVxfixjlD'^=. 

1  -  (1  - Pifx^Xj  |D,-0)*  n  0- p(Xi,Xj  e  Pac(X/)  I z>, -0) 

XtyXj 

Unfortunately,  this  approach  cannot  be  used  to  compute  the  probability 
of  arbitrary  structural  features.  For  example,  we  cannot  compute  the  proba¬ 
bility  that  there  exists  some  directed  path  from  X,-  to  Xj,  as  we  would  have 
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to  consider  all  possible  ways  in  which  this  path  could  manifest  through  our 
exponentially  many  structures. 

We  can  overcome  this  difficulty  using  a  simple  sampling  approach.  Eq.  (9) 
provides  us  with  a  closed  form  expression  for  the  exact  posterior  probability 
of  the  different  possible  families  of  the  node  Xj.  We  can  therefore  easily  sam¬ 
ple  entire  networks  from  the  posterior  distribution  given  the  order:  we  simply 
sample  a  family  for  each  node,  according  to  the  distribution  in  Eq.  (9).  We  can 
then  use  the  sampled  networks  to  evaluate  any  feature,  such  as  the  existence 
of  a  causal  path  from  X{  to  Xj. 


4.  MCMC  methods 

In  the  previous  section,  we  made  the  simplifying  assumption  that  we  were 
given  a  predetermined  order.  Although  this  assumption  might  be  reasonable 
in  certain  cases,  it  is  clearly  too  restrictive  in  domains  where  we  have  very 
little  prior  knowledge  (e.g.,  our  biology  domain).  We  therefore  want  to  con¬ 
sider  structures  consistent  with  all  n\  possible  orders  over  BN  nodes.  Here, 
unfortunately,  we  have  no  elegant  tricks  that  allow  a  closed  form  solution. 
Therefore,  we  provide  a  solution  which  uses  our  closed  form  solution  of  E- 
q.  (7)  as  a  subroutine  in  a  Markov  Chain  Monte  Carlo  algorithm  (Metropolis 
et  al.,  1953).  This  hybrid  algorithm  is  a  form  of  Rao-Blackwellized  Monte 
Carlo  sampling  algorithm  (Casella  and  Robert,  1996).  Related  approaches, 
called  mixture  estimators  were  proposed  and  analyzed  by  Gelfand  and  Smith 
(1990)  and  by  Liu  et  al.  (1994)  (see  discussion  below).  This  approach  is 
somewhat  related  to  the  work  of  Larranaga  et  al.  (1996),  which  proposes 
the  use  of  a  genetic  algorithm  to  search  for  a  high-scoring  order;  there,  how¬ 
ever,  the  score  of  an  order  is  the  score  of  a  single  high-scoring  structure  (as 
found  by  the  K2  algorithm  of  Cooper  and  Herskovits  (1992)),  and  the  overall 
purpose  is  model  selection  rather  than  model  averaging.  Furthermore,  genetic 
algorithms,  unlike  MCMC,  are  not  guaranteed  to  generate  samples  from  the 
posterior  distribution. 

4.1.  The  basic  algorithm 

We  introduce  a  uniform  prior  over  orders  -<,  and  define  P(G  |-<)  to  be  of 
the  same  nature  as  the  priors  we  used  in  the  previous  section.  It  is  important 
to  note  that  the  resulting  prior  over  structures  has  a  different  form  than  our 
original  prior  over  structures.  For  example,  if  we  define  P(G  |-<)  to  be  uni¬ 
form,  we  have  that  P(G)  is  not  uniform:  graphs  that  are  consistent  with  more 
orders  are  more  likely.  For  example,  a  Naive  Bayes  graph  is  consistent  with 
(n  —  1)!  orders,  whereas  any  chain-structured  graph  is  consistent  with  only 
one.  As  one  consequence,  our  induced  structure  distribution  is  not  hypothesis 
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equivalent  (Heckerman  et  al.,  1995),  in  that  different  network  structures  that 
are  in  the  same  equivalence  class  often  have  different  priors.  For  example,  the 
chain  X  Y  -»  Z  is  associated  with  a  unique  order,  whereas  the  equivalent 
structure  X  4-  Y  -»  Z  is  associated  with  two  orders,  and  is  therefore  twice  as 
likely  a  priori.  However,  as  Heckerman  et  al.  observe,  hypothesis  equivalence 
is  often  too  strong  an  assumption  (e.g.,  in  causal  settings).  They  propose 
likelihood  equivalence  as  a  substitute,  a  property  which  clearly  holds  in  our 
setting. 

In  general,  while  this  discrepancy  in  priors  is  unfortunate,  it  is  important 
to  see  it  in  proportion.  The  standard  priors  over  network  structures  are  often 
used  not  because  they  are  particularly  well-motivated,  but  rather  because  they 
are  simple  and  easy  to  work  with.  In  fact,  the  ubiquitous  uniform  prior  over 
structures  is  far  from  uniform  over  PDAGs  (Markov  equivalence  classes)  — 
PDAGs  consistent  with  more  structures  have  a  higher  induced  prior  probabil¬ 
ity.  One  can  argue  that,  for  causal  discovery,  a  uniform  prior  over  PDAGs  is 
more  appropriate;  nevertheless,  a  uniform  prior  over  networks  is  most  often 
used  for  practical  reasons.  Finally,  the  prior  induced  over  our  networks  does 
have  some  justification:  one  can  argue  that  a  structure  which  is  consistent  with 
more  orders  makes  fewer  assumptions  about  causal  ordering,  and  is  therefore 
more  likely  a  priori  (e.g.,  (Wallace  et  al.,  1996)). 

We  now  construct  a  Markov  chain  5Vf,  with  state  space  O  consisting  of 
all  n\  orders  -<;  our  construction  will  guarantee  that  fM  has  the  stationary 
distribution  P(-<|  D).  We  can  then  simulate  this  Markov  chain,  obtaining  a 
sequence  of  samples  -<i, . . . ,  <t-  We  can  now  approximate  the  expected  value 
of  any  function  g(-<)  as: 

r=l 

Specifically,  we  can  let  g(<)  be  P(f  |-<,D)  for  some  feature  (edge)  /.  We  can 
then  compute  g(^t)  =  P(f  |^r,D),  as  described  in  the  previous  section. 

It  remains  only  to  discuss  the  construction  of  the  Markov  chain.  We  use  a 
standard  Metropolis  algorithm  (Metropolis  et  al.,  1953).  We  need  to  guarantee 
two  things: 

—  that  the  chain  is  reversible,  i.e.,  P(-<  t->  -<’)  =  P(-<'  -<); 

-  that  the  stationary  distribution  of  the  chain  is  the  desired  posterior  dis¬ 
tribution  P(-<\  D). 

We  accomplish  this  goal  using  a  standard  Metropolis  sampling.  For  each 
order  we  define  a  proposal  probability  <?(  V|~<),  which  defines  the  proba¬ 
bility  that  the  algorithm  will  “propose”  a  move  from  -<  to  The  algorithm 
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then  accepts  this  move  with  probability 


min 


pKiD)gHK)| 


It  is  well  known  that  the  resulting  chain  is  reversible  and  has  the  desired 
stationary  distribution  (Gilks  et  al.,  1996). 

We  consider  several  specific  constructions  for  the  proposal  distribution, 
based  on  different  neighborhoods  in  the  space  of  orders.  In  one  very  sim¬ 
ple  construction,  we  consider  only  operators  that  flip  two  nodes  in  the  order 
(leaving  all  others  unchanged): 


(4 . . .  iy . . .  4 . . .  in)  (4 . . .  4  •  •  •  ij  •  •  •  hi )  ■ 


Another  operator  is  “cutting  the  deck”  in  the  order: 


(4  •  •  •  ijij+i  ■  ■  ■  in)  (4+i  ■  •  •  ink  ■  •  •  ij)- 

In  both  cases,  all  possible  operators  are  proposed  with  equal  probability.  We 
note  that,  in  this  case,  the  proposal  probabilities  and  p(-<|— <')  are 

always  equal,  so  the  associated  term  cancels  out  in  the  acceptance  probability. 

We  note  that  these  two  types  of  operators  are  qualitatively  very  different. 
The  “flip”  operator  takes  much  smaller  steps  in  the  space,  and  is  therefore 
likely  to  mix  much  more  slowly.  However,  any  single  step  is  substantially 
more  efficient  to  compute  (see  below).  In  our  implementation,  we  choose 
a  flip  operator  with  some  probability  p,  and  a  cut  operator  with  probability 
1  -  p.  We  then  pick  each  of  the  possible  instantiations  uniformly  (i.e.,  given 
that  we  have  decided  to  cut,  all  n  positions  are  equally  likely). 


4.2.  Computational  tricks 


Although  the  computation  of  the  marginal  likelihood  is  polynomial  in  n,  it 
can  still  be  quite  expensive,  especially  for  large  networks  and  reasonable  size 
k.  We  utilize  several  computational  tricks  for  reducing  the  complexity  of  this 
computation. 

First,  for  each  node  Xj,  we  restrict  attention  to  at  most  mp  other  nodes  as 
possible  parents  (for  some  fixed  mp).  We  select  these  mp  nodes  in  advance, 
before  any  MCMC  step,  as  follows:  for  each  potential  parent  Xj,  we  compute 
the  score  of  the  single  edge  Xj  X,-;  we  then  select  the  mp  nodes  Xj  for  which 
this  score  was  highest. 

Second,  for  each  node  Xj,  we  precompute  the  score  for  some  number  mp 
of  the  highest-scoring  families.  Again,  this  procedure  is  executed  once,  at  the 
very  beginning  of  the  process.  The  list  of  highest-scoring  families  is  sorted 
in  decreasing  order;  let  ^  be  the  score  of  the  worst  family  in  X;’s  list.  As 
we  consider  a  particular  order,  we  extract  from  the  list  all  families  consistent 
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with  that  order.  We  know  that  all  families  not  in  the  list  score  no  better  than 
£{.  Thus,  if  the  best  family  extracted  from  the  list  is  some  factor  y  better  than 
ii,  we  choose  to  restrict  attention  to  the  families  extracted  from  the  list,  under 
the  assumption  that  other  families  will  have  negligible  effect  relative  to  these 
high-scoring  families.  If  the  best  family  extracted  is  not  that  good,  we  do  a 
full  enumeration. 

When  performing  exhaustive  enumeration,  we  prune  families  that  aug¬ 
ment  low-scoring  families  with  low-scoring  edges.  Specifically,  assume  that 
for  some  family  U,  we  have  that  score  (Xj,  U  |  D)  is  substantially  lower  than 
other  families  enumerated  so  far.  In  this  case,  families  that  extend  U  are 
likely  to  be  even  worse.  More  precisely,  we  define  the  incremental  value 
of  a  parent  Y  for  X,-  to  be  its  added  value  as  a  single  parent:  A(Y;Xj)  = 
score(Xi,Y )  -  score(Xi).  If  we  now  have  a  family  U  such  that,  for  all  other 
possible  parents  Y,  score(Xi,V) + A(T;X,)  is  lower  than  the  best  family  found 
so  far  for  X;,  we  prune  all  extensions  of  U. 

Finally,  we  note  that  when  we  take  a  single  MCMC  step  in  the  space,  we 
can  often  preserve  much  of  our  computation.  In  particular,  let  -<  be  an  order 
and  let  ■<'  be  the  order  obtained  by  flipping  i j  and  4-  Now,  consider  the  terms 
in  Eq.  (7);  those  terms  corresponding  to  nodes  4  in  the  order  -<  that  precede 
ij  or  succeed  4  do  not  change,  as  the  set  of  potential  parent  sets  *11^  is  the 
same.  Furthermore,  the  terms  for  i/  that  are  between  ij  and  4  also  have  a  lot 
in  common  —  all  parent  sets  U  that  contain  neither  i j  nor  4  remain  the  same. 
Thus,  we  only  need  to  subtract 

]T  score (Xi,V  \  D) 

{Uetii,.:  :  VBXh} 

and  add 

^  score (Xi,  U  \  D). 

{Ue7i,  v(  :  U3XfJ 

By  contrast,  the  “cut”  operator  requires  that  we  recompute  the  entire  summa¬ 
tion  over  families  for  each  variable  X,-. 


5.  Experimental  Results 

Evaluating  the  Sampling  Process  Our  first  goal  is  to  evaluate  the  extent  to 
which  the  sampling  process  reflects  the  result  of  true  Bayesian  model  aver¬ 
aging.  We  first  compared  the  estimates  made  by  our  MCMC  sampling  over 
orders  to  estimates  given  by  the  full  Bayesian  averaging  over  networks.  We 
experimented  on  the  nine- variable  Flare  dataset.  We  ran  the  MCMC  sampler 
with  a  bum-in  period  of  1,000  steps  and  then  sampled  every  100  steps;  we 
experimented  with  collecting  5,  20,  and  50  samples.  (We  note  that  these  pa¬ 
rameters  are  probably  excessive,  but  they  ensure  that  we  are  sampling  very 
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Markov  Edges 

Figure  L  Comparison  of  posterior  probabilities  using  true  posterior  over  orders  (x-axis)  ver¬ 
sus  order-MCMC  (y-axis).  The  figures  show  Markov  features  and  Edge  features  in  the  Hare 
dataset  with  100  instances. 


close  to  the  stationary  probability  of  the  process.)  The  results  are  shown  in 
Figure  1.  As  we  can  see,  the  estimates  are  very  robust.  In  fact,  for  Markov 
features  even  a  sample  of  5  orders  gives  a  surprisingly  decent  estimate.  This 
is  due  to  the  fact  that  a  single  sample  of  an  order  contains  information  about 
exponentially  many  possible  structures.  For  edges  we  obviously  need  more 
samples,  as  edges  that  are  not  in  the  direction  of  the  order  necessarily  have 
probability  0.  With  20  and  50  samples  we  see  a  very  close  correlation  between 
the  MCMC  estimate  and  the  exact  computation  for  both  types  of  features. 


Mixing  rate  We  then  considered  larger  datasets,  where  exhaustive  enumer¬ 
ation  is  not  an  option.  For  this  purpose  we  used  synthetic  data  generated 
from  the  Alarm  BN  (Beinlich  et  al.,  1989),  a  network  with  37  nodes.  Here, 
our  computational  tricks  are  necessary.  We  used  the  following  settings:  k 
(max.  number  of  parents  in  a  family)  =  3;  nip  (max.  number  of  potential 
parents)  =  20;  mp  (number  of  families  cached)  =  4000;  and  y  (difference  in 
score  required  in  pruning)  =  10.  Note  that  y  =  10  corresponds  to  a  difference 
of  210  in  the  posterior  probability  of  the  families.  Different  families  have  huge 
differences  in  score,  so  a  difference  of  210  in  the  posterior  probability  is  not 
uncommon. 

Our  first  goal  was  the  comparison  of  the  mixing  rate  of  the  two  MCMC 
samplers.  For  structure-MCMC,  we  used  a  bum  in  of  100,000  iterations  and 
then  sampled  every  25,000  iterations.  For  order-MCMC,  we  used  a  bum  in 
of  10,000  iterations  and  then  sampled  every  2,500  iterations.  In  both  methods 
we  collected  a  total  of  50  samples  per  ran.  We  note  that,  computationally, 
structure-MCMC  is  faster  than  order-MCMC.  In  our  current  implementation. 
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Figure  2.  Plots  of  the  progression  of  the  MCMC  runs.  Each  graph  shows  plots  of  6  inde¬ 
pendent  runs  over  Alarm  with  either  100,  500,  and  1000  instances.  The  graph  plot  the  score 
(log2(P(D  |  G)P(G))  or  log2(P(Z>  | -<)/*(-<)))  of  the  “current”  candidate  (y-axis)  for  different 
iterations  (a- axis)  of  the  MCMC  sampler.  In  each  plot,  three  of  the  runs  are  seeded  with  the 
network  found  by  greedy  hill  climbing  search  over  network  structures.  The  other  three  runs 
are  seeded  either  by  the  empty  network  in  the  case  of  the  structure-MCMC  or  a  random  order 
in  the  case  of  order-MCMC. 


generating  a  successor  network  is  about  an  order  of  magnitude  faster  than 
generating  a  successor  order.  We  therefore  designed  the  runs  in  Figure  2  to 
take  roughly  the  same  amount  of  computation  time. 
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One  phenomenon  that  was  quite  clear  was  that  order-MCMC  runs  mix 
much  faster.  That  is,  after  a  small  number  of  iterations,  these  runs  reached  a 
“plateau”  where  successive  samples  had  comparable  scores.  Runs  started  in 
different  places  (including  random  order  and  orders  seeded  from  the  results 
of  a  greedy-search  model  selection)  rapidly  reached  the  same  plateau.  On  the 
other  hand,  MCMC  runs  over  network  structures  reached  very  different  levels 
of  scores,  even  though  they  were  run  for  a  much  larger  number  of  iterations. 
Figure  2  illustrates  this  phenomenon  for  examples  of  Alarm  with  100,  500, 
and  1000  instances.  Note  the  substantial  difference  in  the  scale  of  the  y-axis 
between  the  two  sets  of  graphs. 

In  the  case  of  100  instances,  both  MCMC  samplers  seemed  to  mix.  Structure- 
MCMC  mixes  after  about  20,000-30,000  iterations,  while  order-MCMC  mix¬ 
es  after  about  1,000-2,000  iterations.  On  the  other  hand,  when  we  examine 
500  samples,  order-MCMC  converges  to  a  high-scoring  plateau,  which  we 
believe  is  the  stationary  distribution,  within  10,000  iterations.  By  contrast, 
different  runs  of  the  structure-MCMC  stayed  in  very  different  regions  of  the 
in  the  first  500,000  iterations.  The  situation  is  even  worse  in  the  case  of  1,000 
instances.  In  this  case,  structure-MCMC  started  from  an  empty  network  does 
not  reach  the  level  of  score  achieved  by  the  runs  starting  from  the  structure 
found  by  greedy  hill  climbing  search.  Moreover,  these  latter  runs  seem  to 
fluctuate  around  the  score  of  the  initial  seed,  never  exploring  another  region 
of  the  space.  Note  that  different  runs  show  differences  of  100  -  500  bits. 
Thus,  the  sub-optimal  runs  sample  from  networks  that  are  at  least  2100  less 
probable! 

Effects  of  Mixing  This  phenomenon  has  two  explanations.  Either  the  seed 
structure  is  the  global  optimum  and  the  sampler  is  sampling  from  the  pos¬ 
terior  distribution,  which  is  “centered”  around  the  optimum;  or  the  sampler 
is  stuck  in  a  local  “hill”  in  the  space  of  structures  from  which  it  cannot  es¬ 
cape.  This  latter  hypothesis  is  supported  by  the  fact  that  runs  starting  at  other 
structures  (e.g.,  the  empty  network)  take  a  very  long  time  to  reach  similar 
level  of  scores,  indicating  that  there  is  a  very  different  part  of  the  space  on 
which  stationary  behavior  is  reached.  We  now  provide  further  support  for 
this  second  hypothesis. 

We  first  examine  the  posterior  computed  for  different  features  in  different 
runs.  Figure  3  compares  the  posterior  probability  of  Markov  features  assigned 
by  different  runs  of  structure-MCMC.  Let  us  first  consider  the  runs  over  500 
instances.  Here,  although  different  runs  give  a  similar  probability  estimate 
to  most  structural  features,  there  are  several  features  on  which  they  differ 
radically.  In  particular,  there  are  features  that  are  assigned  probability  close 
to  1  by  structures  sampled  from  one  run  and  probability  close  to  0  by  those 
sampled  from  the  other.  While  this  behavior  is  less  common  in  the  runs  seed¬ 
ed  with  the  greedy  structure,  it  occurs  even  there.  This  suggests  that  each  of 
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Figure  3.  Scatter  plots  that  compare  posterior  probability  of  Markov  features  on  the  Alarm 
dataset,  as  determined  by  different  runs  of  structure-MCMC.  Each  point  corresponds  to  a 
single  Markov  feature;  its  x  and  y  coordinates  denote  the  posterior  estimated  by  the  two 
compared  runs.  The  position  of  points  is  slightly  randomly  perturbed  to  visualize  clusters 
of  points  in  the  same  position. 


these  runs  (even  runs  that  start  at  the  same  place)  gets  trapped  in  a  different 
local  neighborhood  in  the  structure  space.  Somewhat  surprisingly,  a  similar 
phenomenon  appears  to  occur  even  in  the  case  of  100  instances,  where  the 
runs  appeared  to  mix.  In  this  case,  the  overall  correlation  between  the  runs 
is,  as  we  might  expect,  weaker:  with  100  instances,  there  are  many  more 
high-scoring  structures  and  therefore  the  variance  of  the  sampling  process 
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Figure  4.  Scatter  plots  that  compare  posterior  probability  of  Markov  features  on  the  Alarm 
domain  as  determined  by  different  runs  of  order-MCMC.  Each  point  corresponds  to  a  single 
Markov  feature;  its  x  and  y  coordinates  denote  the  posterior  estimated  by  the  greedy  seeded 
run  and  a  random  seeded  run  respectively. 


is  higher.  However,  we  once  again  observe  features  which  have  probability 
close  to  0  in  one  ran  and  close  to  1  in  the  other.  These  discrepancies  are  not 
as  easily  explained  by  the  variance  of  the  sampling  process.  Therefore,  even 
for  100  instances,  it  is  not  clear  that  structure-MCMC  mixes. 

By  contrast,  comparison  of  the  predictions  of  different  runs  of  order-MCMC 
are  tightly  correlated.  Figure  4  compares  two  runs,  one  starting  from  an  order 
consistent  with  the  greedy  structure  and  the  other  from  a  random  order.  We 
can  see  that  the  predictions  are  very  similar,  both  for  the  small  dataset  and  the 
larger  one.  This  observation  reaffirms  our  claim  that  these  different  runs  are 
indeed  sampling  from  similar  distributions.  That  is,  they  are  sampling  from 
the  true  posterior. 

We  believe  that  the  difference  in  mixing  rate  is  due  to  the  smoother  poste¬ 
rior  landscape  of  the  space  of  orders.  In  the  space  of  networks,  even  a  small 
perturbation  to  a  network  can  lead  to  a  huge  difference  in  score.  By  contrast, 
the  score  of  an  order  is  a  lot  less  sensitive  to  slight  perturbations.  For  one, 
the  score  of  each  order  is  an  aggregate  of  the  scores  of  a  very  large  set 
of  structures;  hence,  differences  in  scores  of  individual  networks  can  often 
cancel  out.  Furthermore,  for  most  orders,  we  are  likely  to  find  a  consistent 
structure  which  is  not  too  bad  a  fit  to  the  data;  hence,  an  order  is  unlikely  to 
be  uniformly  horrible. 

The  disparity  in  mixing  rates  is  more  pronounced  for  larger  datasets.  The 
reason  is  quite  clear:  as  the  amount  of  data  grows,  the  posterior  landscape 
becomes  “sharper”  since  the  effect  of  a  single  change  in  the  structure  is 
amplified  across  many  samples.  As  we  discussed  above,  if  our  dataset  is 
large  enough,  model  selection  is  often  a  good  approximation  to  model  av¬ 
eraging.  However,  it  is  important  to  note  that  500  instances  for  Alarm  are  not 
enough  to  peak  the  posterior  sharply  enough  that  model  selection  is  a  reliable 
approach  to  discovering  structure.  We  can  see  that  by  examining  the  poste- 
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Figure  5.  Scatter  plots  that  compare  posterior  probability  of  Markov  features  on  the  Alarm 
domain  as  determined  by  the  two  different  MCMC  samplers.  Each  point  corresponds  to  a 
single  Markov  feature;  its  x  and  y  coordinates  denote  the  posterior  estimated  by  the  greedy 
seeded  run  of  order-MCMC  and  structure-MCMC,  respectively. 


Structure 


Order 


Figure  6.  Plots  of  the  progression  of  the  MCMC  runs  on  the  Boston-housing  data  set.  Each 
graph  shows  plots  of  4  independent  runs.  All  the  nms  are  seeded  with  the  network  found  by 
searching  over  network  structures. 


rior  probabilities  in  Figure  4.  We  see  that  the  posterior  probability  for  most 
Markov  features  is  fairly  far  from  0  or  1 .  As  Markov  features  are  invariant  for 
all  networks  in  the  same  Markov  equivalence  class  (PDAG),  this  phenomenon 
indicates  that  there  are  several  PDAGs  that  have  high  score  given  the  data.  By 
contrast,  in  the  case  of  1000  instances,  we  see  that  the  probability  of  almost 
all  features  is  clustered  around  0  or  1,  indicating  that  model  selection  is  likely 
to  return  a  fairly  representative  structure  in  this  case. 

A  second  form  of  support  for  the  non-mixing  conjecture  is  obtained  by 
considering  an  even  smaller  data  set:  the  Boston-housing  data  set,  from  the 
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Figure  7.  Scatter  plots  that  compare  posterior  probability  of  Markov  on  the  Boston-housing 
data  set,  as  determined  by  different  runs  of  structure-MCMC  and  order-MCMC. 
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Figure  8.  Scatter  plots  that  compare  posterior  probability  of  Markov  on  the  Boston-housing 
data  set,  as  determined  by  different  runs  of  structure-MCMC  and  order-MCMC  to  the  initial 
seed  of  the  MCMC  runs.  The  x-axis  denotes  whether  the  feature  appears  in  the  seed  network: 
1  if  it  appear  and  0  if  does  not.  The  y-axis  denote  the  estimate  of  the  posterior  probability  of 
the  feature  based  on  the  MCMC  sampling. 

UCI  repository  (Murphy  and  Aha,  1995),  is  a  continuous  domain  with  14 
variables  and  506  samples.  Here,  we  considered  linear  Gaussian  networks, 
and  used  a  standard  Wishart  parameter  prior.  We  started  the  structure-MCMC 
on  the  structure  obtained  from  greedy  hill-climbing  search.  We  started  the 
order-MCMC  on  an  order  consistent  with  that  structure.  As  usual,  as  shown 
in  Figure  6(a),  structure-MCMC  does  not  converge.  However,  as  shown  in 
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Figure  6(b),  the  runs  of  order-MCMC  are  also  somewhat  more  erratic,  indi¬ 
cating  a  more  jagged  posterior  landscape  even  over  orders.  In  a  way,  this  is  not 
surprising,  given  the  large  number  of  instances  and  small  domain.  In  Figure  7, 
we  see  that,  as  above,  different  runs  of  structure-MCMC  lead  to  very  different 
answers,  whereas  different  runs  of  order-MCMC  are  very  consistent. 

More  interesting  is  the  examination  of  the  feature  probabilities  themselves. 
Figure  8(a)  shows  a  comparison  between  the  feature  probabilities  of  structure- 
MCMC  and  those  of  the  structure  returned  by  greedy  search,  used  as  the 
starting  point  for  the  chain.  We  can  see  that  most  of  the  structures  traversed 
by  the  MCMC  search  are  very  similar  to  the  greedy  seed.  By  contrast.  Fig¬ 
ure  8(b)  show  that  order-MCMC  traverses  a  different  region  of  the  space, 
leading  to  very  different  estimates.  It  turns  out  that  the  structure  found  by 
the  greedy  search  is  suboptimal,  but  that  structure-MCMC  remains  stuck  in 
a  local  maximum  around  that  point.  By  contrast,  the  better  mixing  properties 
of  order-MCMC  allow  is  to  break  out  of  this  local  maximum,  and  to  reach 
a  substantially  higher-scoring  region.  Thus,  even  in  cases  where  there  is  a 
dominant  global  maximum,  order-MCMC  can  be  a  more  robust  approach 
than  greedy  hill-climbing,  structure-MCMC,  or  their  combination. 

Comparison  of  Estimates  We  now  compare  the  estimates  of  the  two  ap¬ 
proaches  on  the  Alarm  data  set.  We  deliberately  chose  to  use  the  smaller  data 
sets  for  two  reasons:  to  allow  structure-MCMC  a  better  chance  to  mix,  and 
to  highlight  the  differences  resulting  from  the  different  priors  used  in  the  two 
approaches.  The  results  are  shown  in  Figure  5.  We  see  that,  in  general,  the 
estimates  of  the  two  methods  are  not  too  far  apart,  although  the  posterior 
estimate  of  the  structure-MCMC  is  usually  larger. 

We  attribute  these  discrepancies  in  the  posterior  to  the  different  structure 
prior  we  employ  in  the  order-MCMC  sampler.  To  test  this  conjecture,  in  a  way 
that  decouples  it  from  the  effects  of  sampling,  we  chose  to  compare  the  exact 
posterior  computed  by  summing  over  all  orders  to  the  posterior  computed 
by  summing  over  all  equivalence  classes  of  Bayesian  networks  (PDAGs). 
(I.e.,  we  counted  only  a  single  representative  network  for  each  equivalence 
class.)  Of  course,  in  order  to  do  the  exact  Bayesian  computation  we  need 
to  do  an  exhaustive  enumeration  of  hypotheses.  For  orders,  this  enumeration 
is  possible  for  as  many  as  10  variables,  but  for  structures,  we  are  limited 
to  domains  with  5-6  variables.  We  took  two  data  sets  —  Vote  and  Flare  — 
from  the  UCI  repository  (Murphy  and  Aha,  1995)  and  selected  five  variables 
from  each.  We  generated  datasets  of  sizes  50  and  200,  and  computed  the  full 
Bayesian  averaging  posterior  for  these  datasets  using  both  methods.  Figure  9 
compares  the  results  for  both  datasets.  We  see  that  the  two  approaches  are 
well  correlated,  but  that  the  prior  does  have  some  effect. 

To  gain  a  better  understanding  of  the  general  effect  of  a  structure  prior,  we 
examined  the  sensitivity  of  Bayesian  model  averaging  to  changes  in  the  prior. 
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Figure  9.  Comparison  of  posterior  probabilities  for  different  Markov  features  between  full 
Bayesian  averaging  using:  orders  (x-axis)  versus  PDAGs  (y-axis)  for  two  UCI  datasets  (5 
variables  each). 


Recall  that  our  experiments  use  the  MDL  prior  shown  in  Eq.  (2),  whether 
for  P(G)  (in  structure-MCMC)  or  for  P(G  |-<)  (in  order-MCMC).  We  ran  the 
same  experiment,  raising  this  prior  to  some  power  —  0,  or  2.  Note  that  a 
power  of  0  corresponds  to  a  uniform  prior,  over  structures  in  the  structure- 
MCMC  case  and  over  structures  within  an  order  in  the  order-MCMC  case. 
By  contrast,  a  power  of  2  corresponds  to  an  even  more  extreme  penalty  for 
large  families.  Figure  10  shows  the  comparison  of  the  modified  priors  to 
the  “standard”  case.  As  we  can  expect,  a  stronger  structure  prior  results  in 
lower  posterior  for  features  while  a  uniform  structure  prior  is  more  prone 
to  adding  edges  and  thus  most  features  have  higher  posterior.  Thus,  we  see 
that  the  results  of  a  structure  discovery  algorithm  are  always  sensitive  to  the 
structure  prior,  and  that  even  two  very  reasonable  (and  common)  priors  can 
lead  to  very  different  results.  This  effect  is  at  least  as  large  as  the  effect  of 
using  our  order-based  structure  prior.  Given  that  the  choice  of  prior  is  often 
somewhat  arbitrary,  there  is  no  reason  to  assume  that  our  order-based  prior  is 
less  reasonable  than  any  other. 

Structure  Reconstruction  This  phenomenon  raises  an  obvious  question:  giv¬ 
en  that  the  approaches  give  different  results,  which  is  better  at  reconstructing 
features  of  the  generating  model.  To  test  this,  we  label  Markov  features  in 
the  Alarm  domain  as  positive  if  they  appear  in  the  generating  network  and 
negative  if  they  do  not.  We  then  use  our  posterior  to  try  and  distinguish  “true” 
features  from  “false”  ones:  we  pick  a  threshold  t,  and  predict  that  the  feature 
/  is  “true”  if  P(f)  >  t.  Clearly,  as  we  vary  the  the  value  of  t,  we  will  get 
different  sets  of  features.  At  each  threshold  value  we  can  have  two  types 
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Structure  vs.  Standard  Structure 


Order  vs.  Structure 


Figure  10.  Comparison  of  the  posterior  of  Markov  features  when  we  change  the  structure 
prior  strength  for  Alarm  with  100  instances.  The  top  row  compares  the  modified  prior  (y-axis) 
in  order-MCMC  against  the  standard  prior  (x-axis).  The  middle  row  makes  an  analogous 
comparison  for  structure-MCMC.  The  bottom  compares  the  modified  prior  with  order  (x-axis) 
against  the  modified  prior  with  structures  (y-axis).  Each  column  corresponds  to  a  different 
weighting  of  the  prior,  as  denoted  at  the  top  of  the  column. 


of  errors:  false  positives  —  positive  features  that  are  misclassified  as  nega¬ 
tive,  and  false  negatives  —  negative  features  that  are  classified  as  positive. 
Different  values  of  t  achieve  different  tradeoffs  between  these  two  type  of 
errors.  Thus,  for  each  method  we  can  plot  the  tradeoff  curve  between  the 
two  types  of  errors.  Note  that,  in  most  applications  of  structure  discovery, 
we  care  more  about  false  positives  than  about  false  negatives.  For  example, 
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Figure  12.  Plots  of  the  progression  of  the  MCMC  runs  on  the  Genetics  data  set.  Each  graph 
shows  plots  of  4  independent  runs.  All  the  runs  are  seeded  with  the  network  found  by  searching 
over  network  structures. 


We  computed  such  tradeoff  curves  for  Alarm  data  set  with  100  and  1000 
instances  for  two  types  of  features:  Markov  features  and  Path  features.  The 
latter  represent  relations  of  the  form  “there  is  a  directed  path  from  X  to  Y” 
in  the  PDAG  of  the  network  structure.  Directed  paths  in  the  PDAG  are  very 
meaningful:  if  we  assume  no  hidden  variables,  they  correspond  to  a  situation 
where  X  causes  Y.  As  discussed  in  Section  3,  we  cannot  provide  a  closed 
form  expression  for  the  posterior  of  such  a  feature  given  an  order.  However, 
we  can  sample  networks  from  the  order,  and  estimate  the  feature  relative  to 
those.  In  our  experiments,  we  sampled  10  networks  from  each  order. 

Figure  11  displays  tradeoff  curves  comparing  order-MCMC,  structure- 
MCMC,  and  the  non-parametric  Bootstrap  approach  of  Friedman  et  al.  (1999a), 
a  non-Bayesian  simulation  approach  to  estimate  “confidence”  in  features.  As 
we  can  see,  in  all  but  one  of  the  cases  (path  features  with  100  instances), 
order-MCMC  does  as  well  or  better  than  the  other  approaches,  with  marked 
gains  in  three  cases.  In  particular,  for  t  larger  than  0.4,  order-MCMC  makes 
no  false  positive  errors  for  Markov  features  on  the  1000-instance  data  set.  We 
believe  that  features  it  misses  are  due  to  weak  interactions  in  the  network  that 
cannot  be  reliably  learned  from  such  a  small  data  set. 

Application  to  Gene  Expression  Data  As  stated  in  the  introduction  our  goal 
is  to  apply  structure  estimation  methods  for  causal  learning  from  gene  expres¬ 
sion  data.  We  tested  our  method  on  a  relatively  small  genetic  data  set  of  Fried¬ 
man  et  al.  (2000).  This  data  set  is  derived  from  a  larger  data  set  of  S.  cerevisiae 
cell-cycle  measurements  reported  in  Spellman  et  al.  (1998).  The  data  set  con¬ 
tains  76  samples  of  250  genes.  Friedman  et  al.  discretized  each  measurement 
into  three  values  (“under-expressed”,  “normal”,  “over-expressed”). 

We  applied  the  order  based  MCMC  using  seeding  the  runs  with  ordering 
consistent  with  the  network  found  by  the  search  procedure  of  Friedman  et  al. 
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Figure  13.  Scatter  plots  that  compare  posterior  probability  of  Markov  and  path  features  on  the 
Genetics  data  set,  as  determined  by  different  runs  of  structure-MCMC. 


Markov  features  Path  features 


Figure  14.  Classification  tradeoff  curves  for  different  methods  on  the  simulated  Genetics  data 
set.  The  x-axis  and  the  y-axis  denote  false  positive  and  false  negative  errors,  respectively.  The 
curve  is  achieved  by  different  threshold  values  in  the  range  [0, 1]. . 


(1999b).  For  these  runs,  we  used :  k  (max.  number  of  parents  in  a  family)  =  3; 
mp  (max.  number  of  potential  parents)  =  45;  mp  (number  of  families  cached) 
=  4000;  and  y  (difference  in  score  required  in  pruning)  =  10.  We  used  a  bum- 
in  period  of  4000  iterations,  and  then  sampled  every  400  iterations  collecting 
50  samples  in  each  run. 

Figure  12  shows  the  progression  of  runs  of  the  two  MCMC  methods  on 
this  data.  As  we  can  see,  the  order  based  MCMC  sampler  mixes  rapidly  (af¬ 
ter  few  hundred  iterations).  On  the  other  hand,  the  structure  based  MCMC 
sampler  seems  to  be  mixing  only  after  200,000  iterations.  Figure  13  shows 
comparison  of  estimates  from  two  different  runs  of  the  order  based  MCMC 
sampler.  As  in  other  data  sets,  the  estimates  based  on  two  different  runs  are 
in  close  agreement. 
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Since  we  want  to  use  this  tool  for  scientific  discovery,  we  want  to  evaluate 
how  well  does  Bayesian  structure  estimation  performs  in  this  domain.  To 
do  so  we  performed  the  following  simulation  experiments.  We  sampled  100 
instances  from  the  network  found  by  structure  search  on  the  genetics  data.  We 
then  applied  the  order  based  MCMC  sampler  and  the  bootstrap  approach  and 
evaluated  the  success  in  reconstructing  features  of  the  generating  network. 
Figure  14  shows  the  tradeoff  between  the  two  types  of  errors  for  these  two 
methods  in  predicting  Markov  and  path  features  .  As  we  can  see,  the  order 
based  MCMC  sampler  clearly  outperforms  the  bootstrap. 

We  should  stress  that  the  simulation  is  based  on  a  network  that  is  probably 
simpler  than  the  underlying  structure  (since  we  learned  it  from  few  sam¬ 
ples).  Nonetheless,  we  view  these  results  as  an  indication  that  using  Bayesian 
estimates  is  more  reliable  in  this  domain. 


6.  Discussion  and  future  work 

We  have  presented  a  new  approach  for  estimating  the  posterior  distribution  of 
network  structures  given  a  data  set.  Our  approach  is  based  on  two  main  ideas. 
The  first  is  a  clean  and  computationally  tractable  expression  for  the  posterior 
of  the  data  given  a  known  order  over  network  variables.  The  second  is  Monte 
Carlo  sampling  algorithm  over  orders.  We  have  shown  that  this  approach 
mixes  substantially  faster  than  the  standard  MCMC  algorithm  that  samples 
structures  directly. 

Once  we  have  generated  a  set  of  orders  sampled  from  the  posterior  dis¬ 
tribution,  we  can  use  them  in  a  variety  of  ways.  As  we  have  shown,  we  can 
estimate  the  probabilities  of  certain  structural  features  —  edge  features  or 
adjacency  in  Markov  neighborhoods  —  directly  in  closed  form  for  a  giv¬ 
en  order.  For  other  structural  features,  we  can  estimate  their  probability  by 
sampling  network  structures  from  each  order,  and  testing  for  the  presence  or 
absence  of  the  feature  in  each  structure. 

We  have  shown  that  the  estimates  returned  by  our  algorithm,  using  either 
of  these  two  methods,  are  substantially  more  robust  than  those  obtained  from 
standard  MCMC  over  structures.  To  some  extent,  if  we  ignore  the  different 
prior  used  in  these  two  approaches,  this  phenomenon  is  due  to  the  fact  that 
mixture  estimators  have  lower  variance  than  estimators  based  on  individual 
samples  (Gelfand  and  Smith,  1990;  Liu  et  al.,  1994).  More  significantly,  how¬ 
ever,  we  see  that  the  results  of  MCMC  over  structures  are  substantially  less 
reliable,  as  they  are  highly  sensitive  to  the  region  of  the  space  to  which  the 
Markov  chain  process  happens  to  gravitate. 

We  have  also  tested  the  efficacy  of  our  algorithm  for  the  task  of  recov¬ 
ering  structural  features  which  we  know  are  present.  We  have  shown  that 
our  algorithm  is  always  more  reliable  at  recovering  features  than  MCMC 
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over  structures,  and  in  all  but  one  case  also  more  reliable  than  the  bootstrap 
approach  of  Friedman  et  al.  (1999a). 

We  believe  that  this  approach  can  be  extended  to  deal  with  data  sets  where 
some  of  the  data  is  missing,  by  extending  the  MCMC  over  orders  with  M- 
CMC  over  missing  values,  allowing  us  to  average  over  both.  If  successful, 
we  can  use  this  combined  MCMC  algorithm  for  doing  full  Bayesian  model 
averaging  for  prediction  tasks  as  well.  Finally,  we  plan  to  apply  this  algorithm 
in  our  biology  domain,  in  order  to  try  and  understand  the  underlying  structure 
of  gene  expression. 
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